Refine your search:     
Report No.
 - 
Search Results: Records 1-20 displayed on this page of 62

Presentation/Publication Type

Initialising ...

Refine

Journal/Book Title

Initialising ...

Meeting title

Initialising ...

First Author

Initialising ...

Keyword

Initialising ...

Language

Initialising ...

Publication Year

Initialising ...

Held year of conference

Initialising ...

Save select records

Journal Articles

Parameter optimization for urban wind simulation using ensemble Kalman filter

Onodera, Naoyuki; Idomura, Yasuhiro; Hasegawa, Yuta; Asahi, Yuichi; Inagaki, Atsushi*; Shimose, Kenichi*; Hirano, Kohin*

Keisan Kogaku Koenkai Rombunshu (CD-ROM), 28, 4 Pages, 2023/05

We have developed a multi-scale wind simulation code named CityLBM that can resolve entire cities to detailed streets. CityLBM enables a real time ensemble simulation for several km square area by applying the locally mesh-refined lattice Boltzmann method on GPU supercomputers. On the other hand, real-world wind simulations contain complex boundary conditions that cannot be modeled, so data assimilation techniques are needed to reflect observed data in the simulation. This study proposes an optimization method for ground surface temperature bias based on an ensemble Kalman filter to reproduce wind conditions within urban city blocks. As a verification of CityLBM, an Observing System Simulation Experiment (OSSE) is conducted for the central Tokyo area to estimate boundary conditions from observed near-surface temperature values.

Journal Articles

CityTransformer; A Transformer-based model for contaminant dispersion prediction in a realistic urban area

Asahi, Yuichi; Onodera, Naoyuki; Hasegawa, Yuta; Shimokawabe, Takashi*; Shiba, Hayato*; Idomura, Yasuhiro

Boundary-Layer Meteorology, 186(3), p.659 - 692, 2023/03

 Times Cited Count:0 Percentile:0.01(Meteorology & Atmospheric Sciences)

We develop a Transformer-based deep learning model to predict the plume concentrations in the urban area under uniform flow conditions. Our model has two distinct input layers: Transformer layers for sequential data and convolutional layers in convolutional neural networks (CNNs) for image-like data. Our model can predict the plume concentration from realistically available data such as the time series monitoring data at a few observation stations and the building shapes and the source location. It is shown that the model can give reasonably accurate prediction with orders of magnitude faster than CFD simulations. It is also shown that the exactly same model can be applied to predict the source location, which also gives reasonable prediction accuracy.

Journal Articles

Data assimilation of three-dimensional turbulent flow using lattice Boltzmann method and local ensemble transform Kalman filter (LBM-LETKF)

Hasegawa, Yuta; Onodera, Naoyuki; Asahi, Yuichi; Idomura, Yasuhiro

Dai-36-Kai Suchi Ryutai Rikigaku Shimpojiumu Koen Rombunshu (Internet), 5 Pages, 2022/12

This study implemented and tested the ensemble data assimilation (DA) of turbulent flows using the lattice Boltzmann method and the local ensemble transform Kalman filter (LBM-LETKF). The computational code was implemented fully on GPUs. The test was carried out for the 3D turbulent flow around a square cylinder with $$2.3times10^{7}$$ meshes and 32 ensemble members using 32 GPUs. The time interval of the DA in the test was a half of the period of the Kalman vortex shedding. The normalized mean absolute errors (NMAE) of the lift coefficient were 132%, 148%, and 13.2% for the non-DA case, the nudging case (a simpler DA algorithm), and the LETKF case, respectively. It was found that the LETKF achieved good DA accuracy even though the observation was not frequent enough for the small scale turbulence, while the nudging showed systematic delays in its solution, and could not keep the DA accurately.

Journal Articles

Performance portability with C++ parallel algorithm

Asahi, Yuichi; Padioleau, T.*; Latu, G.*; Bigot, J.*; Grandgirard, V.*; Obrejan, K.*

Dai-36-Kai Suchi Ryutai Rikigaku Shimpojiumu Koen Rombunshu (Internet), 8 Pages, 2022/12

We implement a kinetic plasma simulation code with multiple performance portable frameworks and evaluated its performance on Intel Icelake, NVIDIA V100 and A100 GPUs, and AMD MI100 GPU. Relying on the language standard parallelism stdpar and proposed language standard multi-dimensional array support mdspan, we demonstrate a performance portable implementation without harming the readability and productivity. With stdpar, we obtain a good overall performance for a kinetic plasma mini-application in the range of $$pm$$ 20% to the Kokkos version on Icelake, V100, A100 and MI100. We conclude that stdpar can be a good candidate to develop a performance portable and productive code targeting Exascale era platforms, assuming this programming model will be available on AMD and/or Intel GPUs in the future.

Journal Articles

Performance portable Vlasov code with C++ parallel algorithm

Asahi, Yuichi; Padioleau, T.*; Latu, G.*; Bigot, J.*; Grandgirard, V.*; Obrejan, K.*

Proceedings of 2022 International Workshop on Performance, Portability, and Productivity in HPC (P3HPC) (Internet), p.68 - 80, 2022/11

 Times Cited Count:0 Percentile:0(Computer Science, Theory & Methods)

This paper presents the performance portable implementation of a kinetic plasma simulation code with C++ parallel algorithm to run across multiple CPUs and GPUs. Relying on the language standard parallelism stdpar and proposed language standard multi-dimensional array support mdspan, we demonstrate that a performance portable implementation is possible without harming the readability and productivity. We obtain a good overall performance for a mini-application in the range of 20% to the Kokkos version on Intel Icelake, NVIDIA V100, and A100 GPUs. Our conclusion is that stdpar can be a good candidate to develop a performance portable and productive code targeting the Exascale era platform, assuming this approach will be available on AMD and/or Intel GPUs in the future.

Journal Articles

Performance measurement of an urban wind simulation code with the Locally Mesh-Refined Lattice Boltzmann Method over NVIDIA and AMD GPUs

Asahi, Yuichi; Onodera, Naoyuki; Hasegawa, Yuta; Shimokawabe, Takashi*; Shiba, Hayato*; Idomura, Yasuhiro

Keisan Kogaku Koenkai Rombunshu (CD-ROM), 27, 5 Pages, 2022/06

We have ported the GPU accelerated Lattice Boltzmann Method code "CityLBM" to AMD MI100 GPU. We present the performance of CityLBM achieved on NVIDIA P100, V100, A100 GPUs and AMDMI100 GPU. Using the host to host MPI communications, the performance on MI100 GPU is around 20% better than on V100 GPU. It has turned out that most of the kernels are successfully accelerated except for interpolation kernels for Adaptive Mesh Refinement (AMR) method.

Journal Articles

GPU optimization of lattice Boltzmann method with local ensemble transform Kalman filter

Hasegawa, Yuta; Imamura, Toshiyuki*; Ina, Takuya; Onodera, Naoyuki; Asahi, Yuichi; Idomura, Yasuhiro

Proceedings of 13th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Heterogeneous Systems (ScalAH22) (Internet), p.10 - 17, 2022/00

The ensemble data assimilation of computational fluid dynamics simulations based on the lattice Boltzmann method (LBM) and the local ensemble transform Kalman filter (LETKF) is implemented and optimized on a GPU supercomputer based on NVIDIA A100 GPUs. To connect the LBM and LETKF parts, data transpose communication is optimized by overlapping computation, file I/O, and communication based on data dependency in each LETKF kernel. In two dimensional forced isotropic turbulence simulations with the ensemble size of $$M=64$$ and the number of grid points of $$N_x=128^2$$, the optimized implementation achieved $$times3.85$$ speedup from the naive implementation, in which the LETKF part is not parallelized. The main computing kernel of the local problem is the eigenvalue decomposition (EVD) of $$Mtimes M$$ real symmetric dense matrices, which is computed by a newly developed batched EVD in EigenG. The batched EVD in EigenG outperforms that in cuSolver, and $$times64$$ speedup was achieved.

Journal Articles

Tree cutting approach for domain partitioning on forest-of-octrees-based block-structured static adaptive mesh refinement with lattice Boltzmann method

Hasegawa, Yuta; Aoki, Takayuki*; Kobayashi, Hiromichi*; Idomura, Yasuhiro; Onodera, Naoyuki

Parallel Computing, 108, p.102851_1 - 102851_12, 2021/12

 Times Cited Count:2 Percentile:32.94(Computer Science, Theory & Methods)

The aerodynamics simulation code based on the lattice Boltzmann method (LBM) using forest-of-octrees-based block-structured local mesh refinement (LMR) was implemented, and its performance was evaluated on GPU-based supercomputers. We found that the conventional Space-Filling-Curve-based (SFC) domain partitioning algorithm results in costly halo communication in our aerodynamics simulations. Our new tree cutting approach improved the locality and the topology of the partitioned sub-domains and reduced the communication cost to one-third or one-fourth of the original SFC approach. In the strong scaling test, the code achieved maximum $$times1.82$$ speedup at the performance of 2207 MLUPS (mega- lattice update per second) on 128 GPUs. In the weak scaling test, the code achieved 9620 MLUPS at 128 GPUs with 4.473 billion grid points, while the parallel efficiency was 93.4% from 8 to 128 GPUs.

Journal Articles

Optimization strategy for a performance portable Vlasov code

Asahi, Yuichi; Latu, G.*; Bigot, J.*; Grandgirard, V.*

Proceedings of 2021 International Workshop on Performance, Portability, and Productivity in HPC (P3HPC) (Internet), p.79 - 91, 2021/11

This paper presents optimization strategies dedicated to a kinetic plasma simulation code that makes use of OpenACC/OpenMP directives and Kokkos performance portable framework to run across multiple CPUs and GPUs. We evaluate the impacts of optimizations on multiple hardware platforms: Intel Xeon Skylake, Fujitsu Arm A64FX, and Nvidia Tesla P100 and V100. After the optimizations, the OpenACC/OpenMP version achieved the acceleration of 1.07 to 1.39. The Kokkos version in turn achieved the acceleration of 1.00 to 1.33. Since the impact of optimizations under multiple combinations of kernels, devices and parallel implementations is demonstrated, this paper provides a widely available approach to accelerate a code keeping the performance portability. To achieve an excellent performance on both CPUs and GPUs, Kokkos could be a reasonable choice which offers more flexibility to manage multiple data and loop structures with a single codebase.

Journal Articles

Improved domain partitioning on tree-based mesh-refined lattice Boltzmann method

Hasegawa, Yuta; Aoki, Takayuki*; Kobayashi, Hiromichi*; Idomura, Yasuhiro; Onodera, Naoyuki

Keisan Kogaku Koenkai Rombunshu (CD-ROM), 26, 6 Pages, 2021/05

We introduce an improved domain partitioning method called "tree cutting approach" for the aerodynamics simulation code based on the lattice Boltzmann method (LBM) with the forest-of-octrees-based local mesh refinement (LMR). The conventional domain partitioning algorithm based on the space-filling curve (SFC), which is widely used in LMR, caused a costly halo data communication which became a bottleneck of our aerodynamics simulation on the GPU-based supercomputers. Our tree cutting approach adopts a hybrid domain partitioning with the coarse structured block decomposition and the SFC partitioning in each block. This hybrid approach improved the locality and the topology of the partitioned sub-domains and reduced the amount of the halo communication to one-third of the original SFC approach. The code achieved $$times 1.23$$ speedup on 8 GPUs, and achieved $$times 1.82$$ speedup at the performance of 2207 MLUPS (mega-lattice update per second) on 128 GPUs with strong scaling test.

Journal Articles

Acceleration of locally mesh allocated Poisson solver using mixed precision

Onodera, Naoyuki; Idomura, Yasuhiro; Hasegawa, Yuta; Shimokawabe, Takashi*; Aoki, Takayuki*

Keisan Kogaku Koenkai Rombunshu (CD-ROM), 26, 3 Pages, 2021/05

We develop a mixed-precision preconditioner for the pressure Poisson equation in a two-phase flow CFD code JUPITER-AMR. The multi-grid (MG) preconditioner is constructed based on the geometric MG method with a three- stage V-cycle, and a cache-reuse SOR (CR-SOR) method at each stage. The numerical experiments are conducted for two-phase flows in a fuel bundle of a nuclear reactor. The MG-CG solver in single-precision shows the same convergence histories as double-precision, which is about 75% of the computational time in double-precision. In the strong scaling test, the MG-CG solver in single-precision is accelerated by 1.88 times between 32 and 96 GPUs.

Journal Articles

Multi-resolution steady flow prediction with convolutional neural networks

Asahi, Yuichi; Hatayama, Sora*; Shimokawabe, Takashi*; Onodera, Naoyuki; Hasegawa, Yuta; Idomura, Yasuhiro

Keisan Kogaku Koenkai Rombunshu (CD-ROM), 26, 4 Pages, 2021/05

We develop a convolutional neural network model to predict the multi-resolution steady flow. Based on the state-of-the-art image-to-image translation model Pix2PixHD, our model can predict the high resolution flow field from the signed distance function. By patching the high resolution data, the memory requirements in our model is suppressed compared to Pix2PixHD.

Journal Articles

GPU acceleration of multigrid preconditioned conjugate gradient solver on block-structured Cartesian grid

Onodera, Naoyuki; Idomura, Yasuhiro; Hasegawa, Yuta; Yamashita, Susumu; Shimokawabe, Takashi*; Aoki, Takayuki*

Proceedings of International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2021) (Internet), p.120 - 128, 2021/01

 Times Cited Count:0 Percentile:0.01(Computer Science, Hardware & Architecture)

We develop a multigrid preconditioned conjugate gradient (MG-CG) solver for the pressure Poisson equation in a two-phase flow CFD code JUPITER. The MG preconditioner is constructed based on the geometric MG method with a three-stage V-cycle, and a RB-SOR smoother and its variant with cache-reuse optimization (CR-SOR) are applied at each stage. The numerical experiments are conducted for two-phase flows in a fuel bundle of a nuclear reactor. The MG-CG solvers with the RB-SOR and CR-SOR smoothers reduce the number of iterations to less than 15% and 9% of the original preconditioned CG method, leading to 3.1- and 5.9-times speedups, respectively. The obtained performance indicates that the MG-CG solver designed for the block-structured grid is highly efficient and enables large-scale simulations of two-phase flows on GPU based supercomputers.

Journal Articles

Performance portable implementation of a kinetic plasma simulation mini-app with a higher level abstraction and directives

Asahi, Yuichi; Latu, G.*; Bigot, J.*; Grandgirard, V.*

Proceedings of Joint International Conference on Supercomputing in Nuclear Applications + Monte Carlo 2020 (SNA + MC 2020), p.218 - 224, 2020/10

Performance portability is expected to be a critical issue in the upcoming exascale era. We explore a performance portable approach for a fusion plasma turbulence simulation code employing the kinetic model, namely the GYSELA code. For this purpose, we extract the key features of GYSELA such as the high dimensionality (more than 4D) and the semi-Lagrangian scheme, and encapsulate them into a mini-application which solves the similar but a simplified Vlasov-Poisson system as GYSELA. We implement the mini-app with OpenACC, OpenMP4.5 and Kokkos, where we suppress unnecessary duplications of code lines. Based on our experience, we discuss the advantages and disadvantages of OpenACC, OpenMP4.5 and Kokkos, from the view point of performance portability, readability and productivity.

Journal Articles

Ensemble wind simulations using a mesh-refined lattice Boltzmann method on GPU-accelerated systems

Hasegawa, Yuta; Onodera, Naoyuki; Idomura, Yasuhiro

Proceedings of Joint International Conference on Supercomputing in Nuclear Applications + Monte Carlo 2020 (SNA + MC 2020), p.236 - 242, 2020/10

The wind condition and the plume dispersion in urban areas are strongly affected by buildings and plants, which are hardly described in the conventional mesoscale simulations. To resolve this issue, we developed a GPU-based CFD code using a mesh-refined lattice Boltzmann method (LBM), which enables real-time plume dispersion simulations with a resolution of several meters. However, such high resolution simulations are highly turbulent and the time histories of the results are sensitive to various simulations conditions. In order to improve the reliability of such chaotic simulations, we developed an ensemble simulation approach, which enables a statistical estimation of the uncertainty. We examined the developed code against the field experiment JU2003 in Oklahoma City. In the comparison, the wind conditions showed good agreements, and the average values of the tracer gas concentration satisfied the factor 2 agreements between the ensemble simulation data and the experiment.

Journal Articles

GPU-acceleration of locally mesh allocated two phase flow solver for nuclear reactors

Onodera, Naoyuki; Idomura, Yasuhiro; Ali, Y.*; Yamashita, Susumu; Shimokawabe, Takashi*; Aoki, Takayuki*

Proceedings of Joint International Conference on Supercomputing in Nuclear Applications + Monte Carlo 2020 (SNA + MC 2020), p.210 - 215, 2020/10

This paper presents a GPU-based Poisson solver on a block-based adaptive mesh refinement (block-AMR) framework. The block-AMR method is essential for GPU computation and efficient description of the nuclear reactor. In this paper, we successfully implement a conjugate gradient method with a state-of-the-art multi-grid preconditioner (MG-CG) on the block-AMR framework. GPU kernel performance was measured on the GPU-based supercomputer TSUBAME3.0. The bandwidth of a vector-vector sum, a matrix-vector product, and a dot product in the CG kernel gave good performance at about 60% of the peak performance. In the MG kernel, the smoothers in a three-stage V-cycle MG method are implemented using a mixed precision RB-SOR method, which also gave good performance. For a large-scale Poisson problem with $$453.0 times 10^6$$ cells, the developed MG-CG method reduced the number of iterations to less than 30% and achieved $$times$$ 2.5 speedup compared with the original preconditioned CG method.

Journal Articles

GPU-acceleration of locally mesh allocated Poisson solver

Onodera, Naoyuki; Idomura, Yasuhiro; Ali, Y.*; Shimokawabe, Takashi*; Aoki, Takayuki*

Keisan Kogaku Koenkai Rombunshu (CD-ROM), 25, 4 Pages, 2020/06

We have developed the stencil-based CFD code JUPITER for simulating three-dimensional multiphase flows. A GPU-accelerated Poisson solver based on the preconditioned conjugate gradient (P-CG) method with a multigrid preconditioner was developed for the JUPITER with block-structured AMR mesh. All Poisson kernels were implemented using CUDA, and the GPU kernel function is well tuned to achieve high performance on GPU supercomputers. The developed multigrid solver shows good convergence of about 1/7 compared with the original P-CG method, and $$times$$3 speed up is achieved with strong scaling test from 8 to 216 GPUs on TSUBAME 3.0.

Journal Articles

A Large-scale aerodynamics study on bicycle racing

Aoki, Takayuki*; Hasegawa, Yuta

Jidosha Gijutsu, 74(4), p.18 - 23, 2020/04

Aerodynamics studies for bicycle racings have been carried out by using a CFD simulation based on LES model. For running of alone cyclist and 2-4 cyclists groups, the computational drags are in good agreement with the wind-tunnel experiments. Different shapes of group running and competing two teams are studied. A large-scale computation for a group of 72 cyclists has been performed by using 2.23 billion meshes on a GPU supercomputer.

Journal Articles

Inner and outer-layer similarity of the turbulence intensity profile over a realistic urban geometry

Inagaki, Atsushi*; Wangsaputra, Y.*; Kanda, Manabu*; Y$"u$cel, M.*; Onodera, Naoyuki; Aoki, Takayuki*

SOLA (Scientific Online Letters on the Atmosphere) (Internet), 16, p.120 - 124, 2020/00

 Times Cited Count:1 Percentile:4.56(Meteorology & Atmospheric Sciences)

The similarity of the turbulence intensity profile with the inner-layer and the outer-layer scalings were examined for an urban boundary layer using numerical simulations. The simulations consider a developing neutral boundary layer over realistic building geometry. The computational domain covers an 19.2 km by 4.8 km and extends up to a height of 1 km with 2-m grids. Several turbulence intensity profiles are defined locally in the computational domain. The inner- and outer-layer scalings work well reducing the scatter of the turbulence intensity within the inner- and outer-layers, respectively, regardless of the surface geometry. Although the main scatters among the scaled profiles are attributed to the mismatch of the parts of the layer and the scaling parameters, their behaviors can also be explained by introducing a non-dimensional parameter which consists of the ratio of length or velocity.

Journal Articles

Implementation and performance evaluation of a communication-avoiding GMRES method for stencil-based code on GPU cluster

Matsumoto, Kazuya*; Idomura, Yasuhiro; Ina, Takuya*; Mayumi, Akie; Yamada, Susumu

Journal of Supercomputing, 75(12), p.8115 - 8146, 2019/12

 Times Cited Count:2 Percentile:24.73(Computer Science, Hardware & Architecture)

A communication-avoiding generalized minimum residual method (CA-GMRES) is implemented on a hybrid CPU-GPU cluster, targeted for the performance acceleration of iterative linear system solver in the gyrokinetic toroidal five-dimensional Eulerian code GT5D. In addition to the CA-GMRES, we implement and evaluate a modified variant of CA-GMRES (M-CA-GMRES) proposed in our previous study to reduce the amount of floating-point calculations. This study demonstrates that beneficial features of the CA-GMRES are in its minimum number of collective communications and its highly efficient calculations based on dense matrix-matrix operations. The performance evaluation is conducted on the Reedbush-L GPU cluster, which contains four NVIDIA Tesla P100 GPUs per compute node. The evaluation results show that the M-CA-GMRES is 1.09x, 1.22x and 1.50x faster than the CA-GMRES, the generalized conjugate residual method (GCR), and the GMRES, respectively, when 64 GPUs are used.

62 (Records 1-20 displayed on this page)